38 research outputs found
Identifying Impact Factors of Question Quality in Online Health Q&A Communities: an Empirical Analysis on MedHelp
Online health Q&A communities help patients, doctors and other users conveniently search and share healthcare information online and have gained much popularity all over the world. Good-quality questions that raise massive discussions could trigger users’ engagement online, which is beneficial for platform operation. However, little attention has been paid to the antecedents of question quality in online health Q&A communities. To have a deep investigation of healthcare question quality, this research aims to investigate the impact factors from two special aspects that are neglected in previous research, i.e., user’s structural influence and questions’ sentiment. Using a dataset collected from MedHelp, one of the largest online health Q&A communities, we found that users with high structural influences and questions with negative sentiment have positive associations with the answer number of questions. Our research would offer meaningful suggestions to platform managers and users
EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios
Self-Supervised Learning (SSL) models have demonstrated exceptional
performance in various speech tasks, particularly in low-resource and
multilingual domains. Recent works show that fusing SSL models could achieve
superior performance compared to using one SSL model. However, fusion models
have increased model parameter size, leading to longer inference times. In this
paper, we propose a novel approach of predicting other SSL models' features
from a single SSL model, resulting in a light-weight framework with competitive
performance. Our experiments show that SSL feature prediction models outperform
individual SSL models in multilingual speech recognition tasks. The leading
prediction model achieves an average SUPERB score increase of 135.4 in
ML-SUPERB benchmarks. Moreover, our proposed framework offers an efficient
solution, as it reduces the resulting model parameter size and inference times
compared to previous fusion models.Comment: 7 pages, 2 figures, 7 table
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
The network architecture of end-to-end (E2E) automatic speech recognition
(ASR) can be classified into several models, including connectionist temporal
classification (CTC), recurrent neural network transducer (RNN-T), attention
mechanism, and non-autoregressive mask-predict models. Since each of these
network architectures has pros and cons, a typical use case is to switch these
separate models depending on the application requirement, resulting in the
increased overhead of maintaining all models. Several methods for integrating
two of these complementary models to mitigate the overhead issue have been
proposed; however, if we integrate more models, we will further benefit from
these complementary models and realize broader applications with a single
system. This paper proposes four-decoder joint modeling (4D) of CTC, attention,
RNN-T, and mask-predict, which has the following three advantages: 1) The four
decoders are jointly trained so that they can be easily switched depending on
the application scenarios. 2) Joint training may bring model regularization and
improve the model robustness thanks to their complementary properties. 3) Novel
one-pass joint decoding methods using CTC, attention, and RNN-T further
improves the performance. The experimental results showed that the proposed
model consistently reduced the WER.Comment: Accepted by INTERRSPEECH202
HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model
Recently, the usefulness of self-supervised representation learning (SSRL)
methods has been confirmed in various downstream tasks. Many of these models,
as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral
features or the model's own representation features. From previous studies, it
is known that the pseudo-labels contain semantic information. However, the
masked prediction task, the learning criterion of HuBERT, focuses on local
contextual information and may not make effective use of global semantic
information such as speaker, theme of speech, and so on. In this paper, we
propose a new approach to enrich the semantic representation of HuBERT. We
apply topic model to pseudo-labels to generate a topic label for each
utterance. An auxiliary topic classification task is added to HuBERT by using
topic labels as teachers. This allows additional global semantic information to
be incorporated in an unsupervised manner. Experimental results demonstrate
that our method achieves comparable or better performance than the baseline in
most tasks, including automatic speech recognition and five out of the eight
SUPERB tasks. Moreover, we find that topic labels include various information
about utterance, such as gender, speaker, and its theme. This highlights the
effectiveness of our approach in capturing multifaceted semantic nuances.Comment: Submitted to IEEE ICASSP 202
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Beam search, which is the dominant ASR decoding algorithm for end-to-end
models, generates tree-structured hypotheses. However, recent studies have
shown that decoding with hypothesis merging can achieve a more efficient search
with comparable or better performance. But, the full context in recurrent
networks is not compatible with hypothesis merging. We propose to use
vector-quantized long short-term memory units (VQ-LSTM) in the prediction
network of RNN transducers. By training the discrete representation jointly
with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN
transducers improve ASR performance over transducers with regular prediction
networks while also producing denser lattices with a very low oracle word error
rate (WER) for the same beam size. Additional language model rescoring
experiments also demonstrate the effectiveness of the proposed lattice
generation scheme.Comment: Interspeech 2022 accepted pape
Exploration on HuBERT with Multiple Resolutions
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL)
model in speech processing. However, we argue that its fixed 20ms resolution
for hidden representations would not be optimal for various speech-processing
tasks since their attributes (e.g., speaker characteristics and semantics) are
based on different time scales. To address this limitation, we propose
utilizing HuBERT representations at multiple resolutions for downstream tasks.
We explore two approaches, namely the parallel and hierarchical approaches, for
integrating HuBERT features with different resolutions. Through experiments, we
demonstrate that HuBERT with multiple resolutions outperforms the original
model. This highlights the potential of utilizing multiple resolutions in SSL
models like HuBERT to capture diverse information from speech signals.Comment: Accepted to Interspeech202